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Abstract 

Background: Current advances of the next-generation sequencing technology have revealed a large number of 
un-annotated RNA transcripts. Connparative study of the RNA structurome is an important approach to assess their 
biological functionalities. Due to the large sizes and abundance of the RNA transcripts, an efficient and accurate RNA 
structure-structure alignment algorithm is in urgent need to facilitate the comparative study. Despite the importance 
of the RNA secondary structure alignment problem, there are no computational tools available that provide high 
computational efficiency and accuracy. In this case, designing and implementing such an efficient and accurate RNA 
secondary structure alignment algorithm is highly desirable. 

Results: In this work, through incorporating the sparse dynamic programming technique, we implemented an 
algorithm that has an O(n^) expected time complexity, where n is the average number of base pairs in the RNA 
structures. This complexity, which can be shown assuming the polymer-zeta property, is confirmed by our 
experiments. The resulting new RNA secondary structure alignment tool is called ERA. Benchmark results indicate 
that ERA can significantly speedup RNA structure-structure alignments compared to other state-of-the-art RNA 
alignment tools, while maintaining high alignment accuracy. 

Conclusions: Using the sparse dynamic programming technique, we are able to develop a new RNA secondary 
structure alignment tool that is both efficient and accurate. We anticipate that the new alignment algorithm ERA will 
significantly promote comparative RNA structure studies. The program, ERA, is freely available at http://genome.ucf. 
edu/ERA. 



Background 

Non-coding RNAs (ncRNAs) have recently been recog- 
nized as important regulators of the biological systems 
[1,2]. They participate in the control of alternative splicing 
[3], gene transcription [4] and translation [5], and mRNA 
localization [6] . Most of the ncRNAs exert their biological 
functions by folding into specific structures, which makes 
the study of the RNA structurome a critical step towards 
complete understanding of the operational mechanism 
of the biological system [7] . Recently, genome-wide RNA 
structurome analysis has led to many interesting discover- 
ies regarding novel regulatory mechanisms. For example, 
analysis of the RNA structural elements in Drosophila 
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melanogaster 3'-UTR suggests a cluster of ncRNA ele- 
ments that can direct the localization of their upstream 
genes within the spermatids [8]. Similar studies have also 
been applied to the Ciona intestinalis genome for novel 
ncRNA family discovery [9]. With the finishing of the 
ENCODE [10] and modENCODE [11] projects, we expect 
that much more RNA transcripts will be experimen- 
tally identified. Many of these RNA transcripts may have 
exceptionally large sizes [12], and calls for more efficient 
computational tools to analyze their structures. 

As more RNA transcripts are being discovered, the 
experimental approaches for probing ncRNA struc- 
tures are also being revolutionized, allowing more 
accurate functional investigation through exploiting the 
structure-function relationship. Traditional RNA three- 
dimensional (3D) structure determination techniques 
such as X-ray crystallography, NMR and cryo-EM are 
expensive, making them inappropriate for genome-wide 
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survey of RNA structures. Currently, the emerging mas- 
sive parallel sequencing technology has been incorporated 
into the traditional chemical probing methods, making 
genome-wide experimental determination of RNA sec- 
ondary structures possible and with low cost. Available 
techniques in this category include PARS [13], FragSeq 
[14], and SHAPE-seq [15]. The RNA secondary structures 
determined by these techniques are much more accurate 
than those predicted by pure computational methods. For 
example, when coupled with SHAPE-seq data, the free 
energy minimization approach [16] is able to predict the 
secondary structure of a 16S rRNA with over 95% accu- 
racy [17]. In this case, the major purpose of this work 
is to develop an efficient and accurate RNA secondary 
structure alignment algorithm to facilitate genome- wide 
comparative studies of these RNA secondary structures. 

There are many existing algorithms that focus on the 
RNA secondary structure alignment problem [18-24]. 
RNA secondary structures can be represented as tree 
structures, and the edit-distance between the tree struc- 
tures can be used to represent their structural similarity 
[19]. Algorithms using such strategy are usually called 
tree editing algorithms. Using heavy path decomposition, 
Klein [25] improved the time complexity of the tree edit- 
ing algorithm to 0(Plogl), Recently, Demaine et al [26] 
further improved the time complexity to 0{1?) based on 
Klein's algorithm. However, Jiang et al [20] proposed to 
compute tree alignment distance for the comparison of 
trees. Algorithms that compute such a measure are called 
tree alignment algorithms. The tree alignment algorithm 
is a special case of the tree editing algorithm [27]. The tree 
alignment algorithm has been implemented into an RNA 
secondary structure alignment tool called RNAf ores ter 
[21]. Both the tree editing and tree alignment algorithms 
rely on tree representation of the RNA structure, and 
make sophisticated scoring functions difficult to imple- 
ment (such as the affine gap penalty for the loop regions). 
In addition, both tree editing and tree alignment algo- 
rithms do not treat base pairs as units of comparison, 
and make it difficult to implement a complete set of 
base-pair edit operations for RNA secondary structures 
editing (base-pair match, mismatch, breaking, altering, 
and removing; as defined by Jiang et al [24]). We demon- 
strate such a problem by showing a real example from 
the implementation of the widely-used RNA secondary 
structure alignment tool RNAf or ester [21]. 

Consider that the two RNA structures shown in 
Figure 1 (a) are being aligned as trees. In the first RNA 
structure, due to the insertion of a uracil (U), an additional 
base pair is predicted (dashed arc. Row 1). Both struc- 
tures are enclosed by G-C base pairs, and we focus on the 
alignment of their inner regions (boxed regions, Row 1). 
Following RNAf orester's extended tree representa- 
tion [21], the two RNA structures can be transformed 



into two trees (Row 2). The 'P' node represents a base 
pair formed between the two corresponding nucleotides. 
Because there is no base pair in the second structure, 
the only allowed operations are bond breaking and base- 
pair deletion (Row 3). For the bond breaking opera- 
tion, the base pair formed between A and U is broken, 
leaving them aligned to A and G in the second struc- 
ture, respectively (blue boxes, Row 3). The alignment 
between the U (first structure) and G (second structure) 
introduces an unnecessary mismatch, making the align- 
ment incorrect (blue boxes. Row 4). For the base-pair 
deletion operation, the entire base pair (including the 
two nucleotides A and U) is deleted (red box. Row 3). 
This operation opens two unnecessary gaps in the align- 
ment (red boxes. Row 4), making it underestimate the 
real structural similarity. On the other hand, we expect to 
handle the mis-predicted base pairs in a more straight- 
forward way. As shown in Figure 1 (b), we simply break 
the base pair interaction and disassociate the two cor- 
responding nucleotides completely (red cross. Row 2). 
These two nucleotides are then treated as regular 
unpaired nucleotides. We can use the standard sequence 
alignment algorithm [28] (with affine gap penalty for bet- 
ter alignment quality in the unpaired regions) to evaluate 
the pure sequence similarity between the boxed hairpin- 
loop regions (Row 3). The resulting alignment contains 
only one gap, and correctly interprets the true structural 
difference between the two RNA structures (red boxes, 
Row 4). 

The above example clearly shows the limitation of 
the implementation of the tree-based RNA secondary 
structure alignment algorithm RNAf or ester. Imple- 
menting the complete set of base-pair edit operations 
under the tree representation appears to be not a triv- 
ial task. Therefore, we propose to implement the general 
edit-distance alignment approach where all edit opera- 
tions can be implemented naturally. To guarantee that 
the implementation is as efficient as the Demaine et al/s 
algorithm (0(/^)), we incorporate the sparse DP tech- 
nique into a simultaneous alignment and folding (SAF) 
algorithm RNAscf [29] and restrict its input to fixed 
RNA secondary structures (recall that the general edit- 
distance alignment algorithm is a restricted case of the 
SAF algorithm). Using this technique, we can reduce the 
original time complexity by reducing a factor from n^ 
to Zt where n is the number of base pairs in the fixed 
RNA structures and n < z < n^. Under the assump- 
tion of the polymer-zeta property of RNA molecules [30], 
it is expected that z <^ n^ and even z e 0(n), In 
this case, the new general edit-distance RNA structure- 
structure alignment algorithm will have a time complexity 
of 0(zn^ + zP). The new time complexity has an expected 
cubic {z = 0(n) = 0(1)) growth behavior, and is the 
same as Demaime et al/s algorithm [26]. In addition, we 
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Figure 1 Comparison between the tree-based alignment approach and the general edit-distance alignment approach in handling 
mis-predicted base pairs, (a) The tree-based alignment algorithm in handling mis-predicted base pairs. Row 1 : The arcs on the sequences indicate 
the base pairs (solid arc indicates real base pairs, while dashed arc indicates mis-predicted base pairs). The structure regions indicated by the boxes 
are being aligned. Row 2: The two RNA structures are modeled into trees according to RNAf orester [21]. The 'P' node was introduced to 
represent a base pair. Row 3: Either the bond breaking or the base-pair deletion operation is taken. The blue boxes indicate the aligned nucleotides 
in the bond-breaking case. The red box indicates the base pair (including its nucleotides) being deleted in the base-pair deletion case. Row 4: The 
corresponding alignments resulted from both operations. The boxes in the alignments correspond to those in the RNA structure trees. Neither of 
the alignments is correct, (b) The general edit-distance alignment algorithm in handling mis-predicted base pairs. Row 1 : The same RNA structures 
are being aligned. Row 2: The base-pair interaction is deleted (red cross), leaving two free nucleotides. Row 3: The sequence similarity between the 
boxed regions is assessed using a traditional sequence alignment algorithm [28]. Row 4: The corresponding alignment is generated correctly. The 
boxes correspond to nucleotides that form the mis-predicted base pair. 



also devise a novel online pruning technique to further 
speedup the new algorithm, which deletes obsolete candi- 
dates on-the-fly. By combining both speedup techniques, 
the new RNA structure alignment algorithm is capable 
of comparing RNA secondary structures efficiently and 
accurately. 

We have implemented the proposed RNA structure 
alignment algorithm into a program called ERA (Efficient 
RNA Alignment). The benchmark results showed that 
ERA has the expected 0(zfi) time complexity. We showed 
the 0(zfi) time complexity of ERA through aligning Rfam 
[31] RNA structures that were carefully chosen to repre- 
sent a wide rage of input sizes. We also used a BraliBase 
II [32] benchmark to compare tools ERA, LocARNA and 
RNAf orester when aligning RNAs with known struc- 
tures. Nearly identical alignment quality can be observed 
for the general edit-distance alignment tools ERA and 
LocARNA, while both of them are more accurate than 
the tree alignment algorithm RNAf orester. Finally, we 
also concluded that ERA is efficiently implemented by 
observing an average of 10 fold speedup over LocARNA, 
and RNAf orester in terms of real RNA structure align- 
ments. Based on these results, we confirmed that the 



sparse DP technique and the online pruning technique 
are successfully incorporated into the original RNAscf 
algorithm. We also anticipate that ERA will become 
an important bioinformatics tool for comparative RNA 
structure analysis. 

Methods 

In this section, we will present a novel general edit- 
distance RNA structure alignment algorithm by incorpo- 
rating the sparse DP technique into the RNAscf algo- 
rithm. RNAscf was originally designed to identify the 
consensus structure between two RNA sequences. It 
guides the DP process though stacks and has a time com- 
plexity of 0(n^ + rP'P). Comparing to LocARNA (which 
has a time complexity of 0(l^-\-n^fi)), the indexing scheme 
used by RNAscf makes it easier to incorporate the sparse 
DP technique, which aims to reduce the size of n instead 
of /. In addition to the sparse DP technique, we will 
also present an online pruning technique, which tries to 
reduce the search space of the algorithm as the DP pro- 
ceeds. Through combining these two speedup techniques, 
the novel algorithm will have an expected 0(zfi) time 
complexity, where n < z <^ rP'. 
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The Methods section is organized as follows: In Section 
'Preliminaries and definitions', we will give the basic 
definition of RNA structures and the RNA alignment 
problem. In Section 'The original 0(n^ + rP'P') algorithm', 
we will reintroduce the RNAscf algorithm as a basis to 
understand the novel algorithm that is developed in this 
work. In Section 'Triangular inequality and optimal pair 
matchings', we will present the triangular inequality in 
RNA alignment with necessary proofs, which serves as 
a theoretical foundation for the sparse DP technique. In 
Section 'Detection of optimal pair matchings', we will fur- 
ther discuss the implementation details of incorporating 
the sparse DP technique. In Section 'A new algorithm with 
cubic time complexity', we will present the novel RNA 
alignment algorithm with the incorporation of the sparse 
DP technique. In Section 'Online pruning of optimal pair 
matchings', we will present the online pruning technique 
as an additional speedup step to the novel algorithm. 
Finally, in Section 'Pseudo-code', we will summarize the 
new algorithm using pseudo-code that can be directly 
implemented. 

Preliminaries and definitions 

We will begin with the introduction of the basic sym- 
bols and notations. The secondary structure of an RNA 
A of length Ia is represented by a set of base pairs in 
A, denoted as V^. A base pair p/^ e is an interac- 
tion formed between two nucleotides in the sequence of 
A, whose positions are denoted by l{p^) and r{p^) (with- 
out loss of generality, we assume l(p^) < r(p^)). The 
base pair can also be represented as (l(p^), r(p^)). The 
base pairs are partially ordered by the increasing order 
of their ending nucleotides, i.e. pf < if and only if 

r(pf) < r{p^). Since we do not consider RNA ensembles, 
no crossing base pair is allowed. That is, we do not allow 
lipf) < l{pf) < r(pf) < r(^). The two base pairs and 
Pj^ are either enclosing or juxtaposing to each other. The 
base pair encloses/?^ if l(p^) < l(pf) < r(pf) < ripf)y 
denoted diS pf <i pf^ The base pmr pf juxtaposes to and 
before p^ if r{pf) < l(p^), and is denoted by pf <j p^. 

We also define loop regions (i.e. hairpin loop, inter- 
nal/bulge loop, and multi-branch loop) whose sequence 
similarities are assessed by the alignment. The loop 
regions can be viewed as the unpaired regions in 
the RNA sequence that are segregated by the paired 
nucleotides. Let A[i . . ./] denote a continuous sequence 
region in RNA A, which begins with the iXh nucleotide 
and ends with the ;th nucleotide. Define L{p^) as the 
sequence A[l{p^) + l...r(/) - 1] (hairpin loop). If 
pf <i pfy define Li(pf,p^) as the sequence A[l(p^)-\- 
l...l(pf) - 1], and Lripf.pf) as the sequence 
A[r{pf) + l...r{p^) — 1] (internal or bulge loop). If 



pf <j p^, define L{pf,p^) as the sequence A[r(pf)-\- 

1 . . . l{p^) — 1] (multi-branch loop). 

The structure alignment between RNA A and B is the 
optimal matching between their base-pair sets and 
and the corresponding loop similarities. In other words, 
the alignment between RNAs A and 5 is a one-to-one 
binary relation A on the base-pair sets and V^, To 
ensure that the alignment will not lead to conflicting base- 
pair matchings, for any (pf.pf,) e A and (pf>pp e A, 
either/?^ </ p^ and pf, </ p^y or pf <j p^ and pf, <j p^,. 
Given the alignment A, the matched base pairs in A will 
partition the RNA sequences A and B into two sets of loop 
regions, Cf^ and Cf^, respectively. The sequence similarity 
between these two sets of loop regions is added to com- 
pute the overall alignment score. The optimal alignment is 
the relation A that maximizes overall alignment score M 
that combines both structure and sequence similarities: 

(1) 

Here, the first term is the summation of all structural 
similarities (Sstr) between the annotated base pairs. The 
structural similarity score for base-pair substitution is set 
using the RIBOSUM matrix [33], denoting such base- 
pair substitution matrix as R, We do not give penalty 
for base-pair deletion or insertion, as we may expect 
incorrectly predicted base pairs in the input RNA struc- 
tures. The second term is the summation of the sequence 
similarities (Sseq) on all loop (unpaired) regions that are 
determined by base-pair matchings in A, The sequence 
similarity between two sequence regions is computed as 
traditional sequence alignment, with D as a 4-by-4 matrix 
that accounts for nucleotide substitution (set using the 
RIBOSUM matrix), g as the gap opening penalty, and e 
as the gap extension penalty [34] {g and e are both set 
to negative values and g < e). The weights wi and W2 
are used to balance the structural and sequence contribu- 
tion to the overall alignment score, and we set wi > W2 
to emphasize structural similarity. To simplify the expres- 
sions, in the rest of this article, we assume that wi has been 
multiplied to all structural similarity terms (R), and W2 
has been multiplied to all sequence similarity terms (A g^ 
and e). 

We will now define the matrices that are used by the 
DP algorithm. Denote M[p^,p^] as the optimal structure 
alignment score between the regions enclosed by p^ and 
p^y given that p^ is matched with p^. Denote M^lp^^p^] as 
the optimal alignment score when the matching of p^ and 
p^ corresponds to a hairpin loop in the consensus struc- 
ture. Similarly, Milp^.p^] stores the optimal alignment 
score when the matching of p^ and p^ corresponds to an 
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internal, a bulge, or a multi loop in the consensus struc- 
ture. Assume that/7^ </ and pf, </ p^y Milp^^p^] 
can be computed by referring to the matrix Mc[pfypff]y 
which stores the optimal alignment score between the 
juxtaposed base-pair chains (each chain contains at least 
one base pair) that end WiXhpf and p^,, respectively. The 
optimal alignment between A and B can be retrieved from 
M\p^,Pq], where p^ and p^ are pseudo base pairs such 
t\v2Xp^ = (0, \A\-l).p^ = (0, 1^1-1), and = 0 
[29]. 

The original 0(n^ + n^l^) algorithm 

In this section, we briefly reintroduce the RNAscf [29] 
algorithm for RNA consensus structure prediction as a 
basis for understanding the novel algorithm developed 
in this work. The recursive functions for the RNAscf 
algorithm are outlined as follows: 



M/,[/,/] = Sstr(p^,p^)+Sseq(L(p^),L(p^)). 

Af/[/,/] = Sstrip^y) + max,-,,v [Mclpf.pf.] 

+ Sseq(Lr(pf.p^).Lr(pf.p^))]. 



(2) 
(3) 

(4) 



pfeTipf) 

pf^:F(p^) 



M[pf.pf^ + Sseq{Ll{pf,p^),Li{p^,,p^)), 
Mc[pf,pfA + M[pf,pf]+Sseq{L(pf,pflL(pf,pf% 

Mc[pf,pfA + G(\L(pf^,pf^)\ + \L(pf^)\), 
Mclpf,pf] + G(\L(pf.pf)\ + \L(pf)\). 

(5) 



In these recursive functions, ^5^^ denotes the structural 
similarity between two base pairs p^ and p^, Sseq denotes 
the sequence similarity between two unpaired regions, 
and G indicates the gap penalty for completely deleting 
the corresponding unpaired region. Note that G(|Z|) = 
g -\- \L\ ^ e if \L\ > 0, and G(|I|) = 0 otherwise. 
The base pair set J-(pf) contains all base pairs that are 
directly before and juxtaposed to pf. In other words, if 
p^ G J^ipf), then there is no such base pair/;^, such that 

p^ <j p^ <j pf' most real scenarios, \T\ is consid- 
ered as a constant [29,35]. This chaining technique based 
on the set enables us to handle the multi-loop case 
efficiently, by only considering | cases when computing 

Recall that the input RNA sequences have an average 
length of / and form an average of n base pairs. This 
algorithm can be computed with an expected time com- 
plexity of 0(n^ + rP'fi), To see the time complexity, first 



note that all sequence similarity scores that are referred in 
the recursive functions can be computed within 0{rP'P') 
time. Because all loop regions are segregated by base 
pairs, the number of loop regions is clearly bounded by 
0{n). Therefore, there are 0{rP') combinations of loop 
matchings, and computing each matching requires Oifi) 
time using a standard sequence alignment algorithm [34] . 
To this point, we assume all sequence similarities are com- 
puted using OirP'P') time, and are stored in a matrix for 
constant-time lookup. Now, observe that this algorithm 
computes the optimal alignment by filling up the DP table 
M, which contains 0(n^) values. Computing each value in 
the matrix M depends on the corresponding values of M/^, 
M/, and M^. The computation of values in matrix Mpi can 
be finished in a constant time due to the pre-computed 
sequence similarities. The computation of M/ requires 
0(n^) time, as determined by the necessity of travers- 
ing all possible combinations i and i' (see Equation 4). 
Finally, Mc can also be expected to be computed in a con- 
stant time, as \J^\ is assumed to be a constant. In this case, 
the computation of matrix Af requires 0(n^) time. Adding 
up the time required to pre-compute all sequence simi- 
larities of the loops, the overall time complexity for this 
algorithm thus becomes 0(n^ + n^P). 

Triangular inequality and optimal pair matchings 

The triangular inequality property servers as the theo- 
retical foundation for the sparse DP technique, which 
saves search space while maintaining the global optimal- 
ity. For computational RNA studies, this technique has 
been used in RNA folding [30], RNA consensus fold- 
ing (SAF) [36,37], as well as RNA-RNA interaction pre- 
diction [38] applications. In this work, our aim is to bring 
this technique into the RNA structure alignment applica- 
tion, where fixed RNA structures are considered instead 
of RNA structure ensembles. 

Consider the alignment between the RNA sec- 
ondary structures within the two regions A[i...j] and 
B[i' ...f] (see Figure 2 (a)). Denote M[i,j)i',f] as the 
optimal alignment score for such alignment. The trian- 
gular inequality can be summarized using the following 
inequality: 

M[/,7; i',f] > M [/, k) i', k'] + M[k + 1,/; k' + 1,/] , 

where i < k < j and i' < k' < /. This is because the par- 
titions of the regions A[i . . .J] and B[i^ . . .f] at positions 
k and k\ respectively, do not necessarily compatible with 
the optimal alignment. 

To simplify the expression of the triangular inequal- 
ity property, we define a number of pseudo base pairs to 
indicate specific regions of interest. A pseudo base pair 
is a void interaction, such that the structural similarity 
between any two pseudo base pairs is defined to be 0. For 
instance, let p and p^ be two arbitrary pseudo base pairs, 
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Figure 2 Illustration of the triangular inequality property, (a) Triangular inequality property of RNA secondary structure alignment. The 
horizontal lines indicate RNA sequences A and B. The dashed arcs are the pseudo base pairs added to the specific nucleotides, while the shaded 
areas define the correspondence between regions that are being aligned, (b) Alternative paths that go through either and p^, orp^ and p^/.The 
two shadings (dark and light gray) along the arcs represent the two alternative paths. 



we will have S strip ^p') = 0. The pseudo base pairs are 
only used for the sake of representational simplicity, and 
are not required for the implementation of the algorithm. 
Define a pseudo base pair p^ = (/,;) and a pseudo 
base pair p^ = (i^f). In this case, the optimal align- 
ment score between the regions A[i . . J] and B[i^ . . ./], 
i.e. can be rewritten as M[p^,p^], Similarly, 

define pseudo base pairs pf = (/,/:), p^ = (/c + 1,;), 
pf, = (i^k^), and p^^ = + 1,/) (see Figure 2 (a)). The 
triangular inequality can be simplified using the following 
observation: 

Observation 1. M\p^,p^] > M[pf,pf] -\-M[p^,p^,]. 

Using Observation 1, we can detect potential redun- 
dant computations in the original algorithm. Consider 
the structural configurations shown in Figure 2 (b), and 
assume that the base pairs p^ and p^ are being aligned at 
the current stage. Let p^ and p^ be arbitrary base pairs 
such that/?^ <i p^ <i pt' Note that/?^ may also repre- 
sent a pseudo base pair in order to consider an arbitrary 
sub region enclosed by p^. Define pseudo base pairs = 
{liptUip^) - = (l(P^)>Kpi) - = (ripj) + 

hr(p^))>pi = (r(p^) + hr(pi)).pt = (l(pi)J(pi) - 1), 
and/7^ = {r(p^) + l,r{p^)), Pseudo base pairs are also 
added to B symmetrically (see Figure 2 (b)). We can then 
prove Lemma 1 using Observation 1: 

Lemma 1. If 3 p^ and p^,, such that M[p^,p^,] 
+ M[p^y^,] ^M[p^y^,] > M[p^ y I then Mlp^^pf^] 



+ > M[p^yj +M[p^,p^] 

+ M[p^.pl], 

Proof, 

M[pi,pf^] +M[p^^.pl^] +M[pI.pI] 
>M[piyj+M[p^^y^.] 

^M[p^^.pl,] +M[pj.pl] ^M[p^,p^^^] 
> M\p^.pI^] +M[p^.p^] +M[p^.p^^] . 

□ 

The first inequality is a direct application of Observation 
1, and the second inequality is specified in the condition 
of Lemma 1. 

Because p^ and p^, are arbitrary base pairs. Lemma 1 
implies that the matching between p^ and p^ is guaran- 
teed to be suboptimal. That is, the overall alignment score, 
given that p^ matches with p^, is always lower than that 
when assuming they do not match (as the matching of p^ 
and p^ is conflicted with the matching of pf^ and pf,, as 
well as the matching of p^ and p^,). In this case, we can 
devise the DP algorithm to bypass the redundant refer- 
ences to the scenarios where p^ matches p^. Conversely, 
for the implementation of this idea, the DP algorithm will 
refer to the scenarios of matching p^ and p^ only when the 
condition specified in Lemma 1 is NOT satisfied. These 
necessary base-pair matchings are called the Optimal Pair 
Matchings (OPMs). If the matching of p^ and p^ is an 
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OPM, we denote this OPM as o Similarly, we repre- 
sent the OPM formed by base pairs pf and pf, as off. 
The new RNA alignment algorithm will maintain an OPM 
list O, which is modified online as the DP proceeds, so 
as to include newly identified OPMs and remove obsolete 
OPMs (which will be discussed in Section 'Online prun- 
ing of optimal pairmatchings'). If we assume that the RNA 
molecules have the polymer-zeta property [30], restrict- 
ing the search space of the DP using the OPM list O will 
reduce the time complexity of the RNA alignment algo- 
rithm to OizP) (as will be discussed in Section 'A new 
algorithm with cubic time complexity ). 

Detection of optimal pair matchings 

In the previous section, we have proved that Lemma 1 
can be used to detect the OPMs and save redundant com- 
putations. In this section, we will briefly discuss how it 
will be implemented. Lemma 1 states that if the align- 
ment score assuming p^ matches p^ {M[p^,p^]) is higher 
than the alignment score assuming p^ does not match p^, 
the matching between p^ and p^ is an OPM. Therefore, 
to detect the OPMs, we need to compute two alignment 
scores, i.e. the one when assuming p^ matches p^ and the 
one when assuming p^ does not match p^. 

Based on previous definition, the first alignment score 
is computed as In this case, we only need to 

compute the second alignment score. However, comput- 
ing the second alignment score (assuming p^ does not 
match p^) is difficult. Instead, we can compute the over- 
all alignment score without assuming any restrictions. 
Apparently, the overall alignment score includes both 
cases disregarding whether p^ matches with p^. There- 
fore, if M[p^,p^] is greater than or equal to such an 
overall optimal alignment, it is guaranteed to be greater 
than the alignment score when assuming p^ does not 
match p^y and ipso facto the matching of p^ and p^ is 
an OPM. 

Recall that the alignment score M[p^yp^] corresponds 
to the case where p^ matches with p^, and therefore it 
can be decomposed as the sum of two parts: the struc- 
ture similarity between the two base pairs themselves 
^strip^iP^)^ and the optimal alignment score between 
the regions A[l{p^) + l...r{p^) - 1] and B[l{p^) + 
1 . . . r{p^) — 1] without any restrictions. In this case, 
define two pseudo base pairs = {l(p^) — 1, 

r(p^) + l) and^ = (l(p^) -hr(p^) + 1), then M[p^,p^] 
can also be decomposed as the sum of two parts: 
Sstr(p^>P^)> and the optimal alignment score between 
the regions A[l(p^) . . . r(p^)] and B[l(p^) . . . r(p^)] with- 
out any restrictions. Note that p^ and p^ are both 
pseudo base pairs, and thus based on the definition, 
we have Sstrip^^P^) = 0. Therefore, M[p^,p^] is 
exactly the overall alignment score we need to detect 
the OPMs. 



In this case, based on Lemma 1, if M[p^,p^] > 
M[p^,p^], we will consider the matching of p^ and p^ as 
an OPM, and add the OPM o^'^ to the OPM list O. The 
overhead for detecting the OPM is that we need to dou- 
ble the computation for each combination of p^ and p^. 
However, such overhead will not raise the time complex- 
ity, and it is worthy as it will lead to a more significant 
speedup of the algorithm. In the following section, we will 
devise a new algorithm by assuming that the OPM list O 
is available. 

A new algorithm with cubic time complexity 

In this section, we introduce a new general edit-distance 
RNA structure alignment algorithm, which improves the 
original RNAscf algorithm based on Lemma 1 and has 
a time complexity of 0(z(«^ + fi)). Here, z is the size 
of the OPM list O, and we expect that z e 0{n) when 
assuming polymer-zeta property [30]. If we also assume 
0{n) = 0(1) (with fixed input RNA structures or effi- 
ciently pruned RNA structure ensembles), the overall time 
complexity of the new algorithm becomes 0(zfi). 

The new algorithm is developed based on the RNAscf 
algorithm [29]. Therefore, we adopt the same definition 
and notation as introduced in Section 'Preliminaries and 
definitions', as well as the similar recursive functions style 
used in Section 'The original 0(n^ + rP'fi) algorithm'. 
Because the computations of M[p^,p^] and M^[p^,p^] 
are boundary cases for the algorithm and are directly com- 
puted without referring to previous alignment results, the 
recursive functions for computing them are exactly the 
same as in the original algorithm: 

Mhlp^.p^] = Sstr(p^.p^)+Sseq(L(p^lL(p^))^ (7) 

The computation of Mi[p^,p^], on the other hand, 
refers to the previous alignment results that assumes pf 
matches pf, (see Equation 4). Using Lemma 1, it is clear to 
see that instead of traversing all combinations of pf and 
pf, we only need to consider the cases when the matching 
ofpf and pf is an OPM: 

Mi\p^y] = Sstr(p^.p^) 

+ max \Mclpf,pf.]+Sseq(Lr(pfy),Lr(pf^.p^))] . 

offeO^ ^ 

(8) 

Similarly, for the computation of Mdpfypf], we need 
to refer to the scenarios where pf matches pf and p^ 
matches p^,. The matching ofpf and pf is guaranteed to 
be an OPM, as ensured by Equation 8. Therefore, we only 
need to modify Equation 5 to ensure that the matching of 
pf and pf is an OPM: 
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),)' hi' 



M\pf,p^]+Sseq(Ll{pf,p^),Ll(pf,y)), 

Mc[pf,pf] +M\pf,pf,]+Sseq(L(pf,pf),L(pf,pf,)), 

Mc\pf,pf,]+Sseq(L(pf,pf),L(pf„pf,))+SseqiL(pf),L(pf,)). 



(9) 



Here, the set J^(off) contains ail OPMs that are directly 
before the OPM off. The set regarding the OPMs is 



number of nucleotide matchings and minimum number 
of gaps: 



defined as the follows. If an OPM off e T{c^f\ then ^ + min(|I(/)|, \L(p^)\) * dy, 



A,B 



either e T(pf) or e T(pf;). 

Recall that the time complexity of the original algo- 
rithm is 0(n^ + n^P). The first term 0(n^) results from 
OirP') computations by traversing all combinations of 
and p^ (see Equation 2) and 0(n^) time for computing 
Ml (see Equation 4). In the new algorithm, we introduce 
the OPM constraint to Equation 8 and Equation 9, and 
thus reduce the time complexity for computing Mi from 
0(n^) to 0(z). In this case, the first term 0(n^) of the 
original time complexity can be reduced to OizrP'). 

The second term 0{yp'P') in the original time complexity 
results from computing the sequence similarities between 
all loop regions. Note that all loop similarities required 
for computing Mi (Equation 8) and Mc (Equation 9) are 
associated with OPMs. For example, in Equation 8, all the 
loops are defined according to pf and pfy whose match- 
ing is expected to be an OPM. And in Equation 9, all the 
loops are defined according to pf and pf^ as well as pf 
and pf,, where both of these matchings are assumed to be 
OPMs. In this case, we do not need to compute loop sim- 
ilarities for all OirP') base-pair combinations, instead we 
only need to compute the loop similarities that are asso- 
ciated with the OPMs. In this case, the time complexity 
for computing the sequence similarities between all loops 
that are required by the computation of M/ and Mc can be 
finished in OizP) time. 

The only exception for the sequence similarity com- 
putation is the hairpin loop similarity Sseq(L(p^),L(p^)), 
which is required for computing Mpi (Equation 7). The 
computation of M/^ is not constrained by the OPM list, 
and therefore 0(n^P) time is still required. To resolve 
this issue, we observe that most of the RNA structure 
alignment algorithms emphasize the structure similarity 
other than sequence similarity (wi > W2 in Equation 
1). In this case, if there exist some base pairs within the 
regions enclosed by p^ and p^ to be matched, we can 
expect that > Mfi[p^,p^] in Equation 6. In this 

case, to avoid the unnecessary computation ofMpi[p^,p^], 
we can derive an upper bound Mk\p^,p^], which satis- 
fies >Mh[p^,p^] and can be estimated in unit 
time. Note that if Mi[p^,p^] > Mpj[p^,p^], we are sure 
that Mi\p^,p^]> Mpilp^.p^] by transition, and thus can 
save the computation of M^lp^.p^]. The upper bound 
^hlp^yf^] can be easily derived by assuming maximum 



(10) 



where d^ax is the highest score in the 4-by-4 nucleotide 
substitution matrix D, and / is a boolean variable that is 
set to 1 if \L(p^)\ ^ \L(p^)\ and set to 0 otherwise. For 
the computation of each M\p^,p^], we first estimate the 
upper bound Mpi in a unit time, and then compute 

Mi[p^,p^] in 0(z) time. By comparing these two values, 
we will determine whether the computation ofMpi\p^,p^] 
is necessary. The computation of Mpi[p^,p^] is only nec- 
essary when there are only a few base pair enclosed by p^ 
and p^ to be matched. Such condition implies the scenar- 
ios that either p^ or p^ is ^. real hairpin loop in the RNA 
structures, whose number is bounded by 0(n). Overall, 
the hairpin loop similarity matrix M^ can be computed 
in 0{nfi) time, and the overall time complexity of this 
algorithm is thus 0{z{rP' + fi)). 

Online pruning of optimal pair matchings 

In the previous sections, we have presented our 
approaches for detecting OPMs and building an OPM list 
O, as well as a more efficient algorithm that is developed 
based on O. Time complexity analysis of the algorithm 
claims that 0{z{n^ + fi)) time is sufficient for this new 
algorithm. The size of the OPM list O, i.e. z, thus becomes 
an important factor that determines the efficiency of the 
novel algorithm. Under the current algorithmic setup, as 
well as other similar works that implement a candidate list 
[30,37], z continuously grows as the algorithm proceeds. 
In this case, it is desirable to devise an online pruning 
technique, which can remove the obsolete OPMs from (9, 
and thus achieve further speedup of the algorithm. 

In this section, we will present such an online prun- 
ing technique to reduce the size of the OPM list O. The 
intuition of this online pruning technique comes from the 
following observation. The RNA structures are primar- 
ily stabilized by a number of helices, or perfectly stacked 
base pairs, lipf is perfectly stacked on pf, then l{pf) — 
l{pf) — 1, and r{pf) = r{pf) + 1. Consider the alignment 
between two helices, where each one of them contains 
m + 1 perfectly stacked base pairs. Assume that the 
first helix contains base pairs pf^ pf j^i^ • • • yff^m' 
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second helix contains base pairs pffipff_^ii • • • Based 
on Lemma 1, there will be at least m OPMs detected from 
such alignment, i.e. off,off^.^^^,...,off^.^^^^ Appar- 
ently, maintaining all these m OPMs is unnecessary, as 
these base pairs should be aligned together as two com- 
plete helices, rather than be aligned separately as two sets 
of individual base pairs. In this case, maintaining only 
one OPM, i.e. of^^ is sufficient to represent such an 
alignment. The other m OPMs become obsolete as soon as 
the OPM ofj^^^^ is detected, and can be removed from 

i-\-m,i -\-m 

the OPM list O to improve computational efficiency. In 
the following paragraphs, we will extend this idea to con- 
sider all situations in addition to the perfectly stacked sce- 
nario, as well as give formal description of this technique 
and related proofs. 

We will demonstrate the major idea of our novel online 
OPM pruning technique using Figure 2 (b). Imagine that 
at the current stage, M[p^,p^] has just been computed 
and (/^'^ has been identified as an OPM, where o^'^, is an 
arbitrary OPM that has been previously identified and is 
enclosed by o^'^ (p^ <j and p^, <j p^). Our aim is to 

estimate whether the detection of the OPM o"^'^ will make 
0^'^/ obsolete. Letp^ and p^, be arbitrary base pairs such 

that p^ <i p^ and p^ <j p^,. The regions enclosed hy p^ 
and p^, can be partitioned using at least one of the fol- 
lowing ways: M[p^,p^,] -\-M[p^,p^] -\-M[pf,p^,] (which 
is indicated by dark gray in Figure 2 (b)) and 
M\pf,pf;]+M\p^,p^^,]+M[p^,p^,] (which is indicated 
by light gray in Figure 2 (b)). If the corresponding score 
for the first path is higher than the second, M[p^,p^,] 
will not be referred to by any future matching between 
arbitrary base pairs and p^,, and thus making the OPM 
o"^'^, obsolete. In this case, the OPM o^'^, can be removed 
from O, 

We can summarize the criterion for removing o^'^, as 
an obsolete OPM using the following inequality: 

^\i4>p^'] +M\p^y] +M\p^y^^] > M[pf ,/^,] 

+ Af[p^,/^,]+M[/^,/^,], 
which can be rewritten as: 

M\p^y] -M[p^,p^^^] > mpt.pf^] -Mip^yj ) 
+ {M\p^yQ^]-M\p^y^^]\ 

To utilize such criterion, we need to have access to all val- 
ues included in the above inequality. However, we only 
know the values at the left hand side of the inequal- 
ity (Mlp^y] and M|^,/?^,]), while the other values at 
the right hand side are unknown. This is because the 
definitions of these pseudo base pairs are determined 



by p^ and p^n which are arbitrary base pairs that have 
not yet been computed by the DP algorithm. To solve 
this issue, observe that the score M[p^,p^,] — M[p^yp^,] 
is strongly related to the regions A[l(p^) . . . r(p^)] 
and B[l(pp...r(ppl and M[^,^^.] is 

strongly related to the regions A[l(pf)...r(pf)] and 
B[l(p^^) . . . r(pf,)]. Note that the regions . . . r(p^)] 

2ind A[l(pf) . . . r(pf)] can be determined when p^ and/?^ 
are known, which makes the estimation of their impact 
on future alignments possible (similarly for the regions 
B[l(pp . . . r(p^;)] and B[l(pf) . . . r(pp]). In this case, we 

can develop two upper bounds Up and Us> such that: 

Us>M[i4,p^,]-M[p^,p^,]. 

In this case, if M[t/,»*] -M\p^,p^,] > Ug + Us, we are 

sure that the criterion for characterizing o^'^, as an obso- 
lete OPM will be satisfied, and we will be able to remove 
0^'^, from O immediately. 

Now, we can discuss the details for setting up the upper 
bounds Up and Us. Because Up and Us are defined sym- 
metrically, we only discuss the computation of Up, Note 
that the upper bound Up needs to satisfy the condi- 
tion Up > M[p^y^]-M[p^,p^^]. Clearly, the differ- 
ence between — M[p^,p^;] directly comes from 
concatenating the region A[l(p^) . . . r(p^)] to the region 
A[l(p^) . . . r(p^)]y as well as concatenating the region 
B[iy^^)...r(pp] to the region B[l(p^^) . . . r(p^^)l The 
best case scenario for such an operation, is to assume that 
the concatenation of the regions A[l(p^) . . . r(p^)] and 

^Uip^f) • • • ^(P^')] will result in as many new base-pair and 
nucleotide matches as possible. 

Assume that there are base pairs that are anno- 
tated in the region A[l(p^) . . . r(p^)], and m^, base 

pairs that are annotated in the region B[l(p^,) . . . r(p^,)]. 
Also assume the maximum base-pair substitution score 
in the RIBOSUM matrix R is r^ax' By concatenat- 
ing the regions A[l(p^) . . . r(p^)] and B[iy^,) . . . r(/7^,)], 

we introduce at most max(m^,m^,) more base-pair 

matchings to the alignment indicated by M[p^,p^,], 
This implies the maximum structure alignment score 
increment of max(m^,m^/) * rmax- Similarly, at most 
m3.x(\L(p^)\,\L(p^,\)) more nucleotide matches, or gap 
fill-ups, are possible, compared to the existing align- 
ment indicated by the score M[p^,p^,], The cor- 
responding alignment score for such case is thus: 
max(|I(^)|,|I(^^,|)) * {dmax - g - e). To explicitly 
represent the upper bound using only the identified 
OPMs, we rename Up as ^/[o^;^„o^'^] (similarly, we 
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rename Us as &^[o^;^„ o^'^]). Therefore, &/[o^;^„ o^'^] 
can be computed using the following 

equations: 

i//[o^;^,.o^"^] = max(m^, w^,) * w + max(\L(p^)\, 
\L(Pp')\) * (dmax - g - e), 

t/,[o^J,o^-^] = maximf,mf) * r^ax + max(|LO;^)|, 
\L(p^')\)*(dmax-g-e). 

(11) 

With the upper bounds i/;[o^;^„ and ^/r[o^;^/, 
0^-^], we are able to formally prove the correctness of the 
online OPM pruning technique: 

Lemma 2. If M\p^y]-M\p^,pl,] > i/;[oJ^„ o^-^] 
+ ^^.[o^J.o^'^], where £^/[o^J,o^'^] > M[p^,p^^,] 
-Mlp^yj andLT.lo^;^,,./'^] >M[p^y,,] -M\p^.p^,]. 

+ M\p^,pf,]+M\}4,4]. 
Proof. 

> Mlp^^.p^,] + Ui[o^^i„o^'^] + Ur[o^^i„ 0^^] 

+ m\p^,pI,]+m\p^,pI] 
+m\p^,pI,]+m\p^,pI,]. 

□ 

As a result, when the condition given in Lemma 2 is 

A B 

satisfied, the enclosed OPM o^'^, can be readily removed. 
Pseudo-code 

The pseudo-code for the new RNA secondary structure 
alignment algorithm that implements both speedup tech- 
niques is summarized in Figure 3. 

Results 

We implemented the proposed general edit-distance RNA 
structural alignment algorithm into a program called 
ERA (Efficient RNA Alignment) using GNU C++. In this 
section, we will show that (1) ERA has the expected 0(zP) 
time complexity; (2) ERA is as accurate as the other state- 
of-the-art RNA alignment tools; and (3) ERA runs much 
faster than the other RNA alignment tools. In addition to 
these goals, we have also benchmarked ERA to demon- 
strate its 0(fi) space complexity. For details regarding the 



space complexity issues please refer to the Additional file 
1: Section SI (also see Figure SI, Figure S2, and Table SI). 

We benchmarked the ERA with two other state-of-the- 
art RNA alignment tools: LocARNA as a representative of 
the general edit-distance RNA structure alignment algo- 
rithms and RNAf ores t er as a representative of the tree- 
based RNA structure alignment algorithms. Note that 
although LocARNA is developed to compare RNA struc- 
ture ensembles, its flexible parameter setup makes it easy 
to prune its input RNA ensembles (see Section 'Running 
LocARNA' for more details). However, the readers should 
note that LocARNA is used in a restricted case for fair 
comparison with ERA, and more potential applications 
of LocARNA should be recognized. We do not compare 
ERA with its predecessor RNAscf, because RNAscf is 
implemented to find consensus helical configurations that 
do not include individual base pairs [29]. Both LocARNA 
and RNAforester were invoked using their default 
parameters. 

Running LocARNA 

Note that LocARNA was originally developed to com- 
pare two RNA structure ensembles [39]. Due to the recent 
technical advances in experimental RNA structure prob- 
ing, we anticipate that RNA structures can be predicted 
with much higher accuracy. Therefore, we develop ERA to 
compare two fixed RNA structures. In this case, we need 
to prune the original inputs of LocARNA, so as to ensure 
that they only represent the fixed structures rather than 
any additional information. 

The input RNA ensembles for LocARNA are repre- 
sented using the base-pairing probability matrices, which 
can be computed using the McCaskills algorithm [40,41]. 
In a base-pairing probability matrix, each base pair (pos- 
sibly crossing) is assigned with a probability to indicate 
its thermodynamic stability. Our goal is to prune such 
a base-pair probability matrix, such that it only contains 
information regarding the fixed RNA structure (in our 
experiment, we take the Rfam [31] annotation or the 
BraliBase II [32] annotation as the fixed structure for an 
RNA sequence). For each base pair in the matrix, if it is not 
presented in the annotated structure, its corresponding 
probability is reset to 0. On the other hand, if it is included 
in the annotated structure, its probability is reset to 1. 
In this case, the pruned base-pairing probability matrix 
contains only the information regarding the fixed RNA 
structure. We show an original and a pruned base-pairing 
probability matrix in Additional file 1: Figure S3 as an 
example. All LocARNA inputs for experiments mentioned 
in this article are preprocessed using this strategy. 

Time complexity 

In this section, we expect to show that the proposed spar- 
sification is successfully implemented, and ERA has the 
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Algorithm 1 Pseudo-code for the new 0{zP) algorithm 

Order base pairs in A by their ending nucleotides; Order base pairs in B by their ending nu- 
cleotides; Initiahze the 0PM list (9^0; 
for i = 1 to IV^I do 

^ ith base pair in A 
for j = 1 to \V^\ do 

^ jih base pair in B; Compute M/[^'^,p^]; 
Estimate M/,[p^,p^] with Equation 10, M/,[p^,p^] ^ M/,[p^,p^]; 
if Mhip'^.p^] > M/[j9^,j9^] then 

Compute M/,[p^,p^]; 
end if 

M[j9^,j9^] ^ max{Mi[p^,p^],Mh[p'^,p^]); Compute M[p^,p^]; 
if M[p^,p^] > M[p^,p^] then 

Identify the matching of p^ and p^ as an 0PM o^'^; 

for each 0PM of '^^ G O do 

Estimate Ui[o^'^ ^ o"^'^] and C/^[of o^'^] with Equation 11; 

if M[p^,p^] > Ui[ot§.o^^^] + Ur[ot§,o^^^] + M[pf ,pf,] and There exists no base 

pair between of and o^'^ then 

Remove o^'^, from the 0PM list O; 
end if 
end for 

Add o^'^ to the 0PM list O; 
end if 
end for 
end for 

Figure 3 Pseudo-code for the implementation of tlie speedup techniques. 



expected 0{zfi) time complexity. To show the 0{zfi) time 
complexity, we chose a number of RNA families from 
Rfam that have a wide range of sequence lengths. We then 
randomly selected two individual RNA structures from 
each family (see Additional file 1: Table S2) to run ERA 
alignment. The running time for their alignments, versus 
(note that n < / for annotated structures and 0{n) = 
0(1)), is plotted in Figure 4 (a). We can clearly observe 
the expected 0{zP') time complexity from the figure. In 
addition, we are also able to show that the speedup ratio, 
when comparing to the 0{l^ -\- rP'P') LocARNA algorithm, 
is strongly correlated with the efficiency of pair match- 
ing reduction due to the sparse DP technique (the ratio 
n^/z, see Figure 4 (b)). The relatively large deviations 
are observed for biocoid_3UTR and snR86 RNA struc- 
tures. This is because they contain a large number of base 
pairs and have a high base pair to sequence length ratio. 
In this case, the overhead for maintaining the OPM list 
becomes apparent and makes the speedup less significant. 
In summary, we have shown that the sparse DP technique 



is successfully implemented, ERA has an expected time 
complexity of 0{zfi), 

Alignment quality 

In addition to time complexity improvement, we also 
expect to show that ERA is as accurate as the other state- 
of-the-art general edit-distance RNA structure alignment 
tools. We used BraliBase II [32] as the reference data 
set, and used its corresponding structure annotations 
as the fixed input structures. We adopted two mea- 
sures to indicate the alignment quality, i.e., the Sum- 
of-Pair Score (SPS) [32] and the Structure Conservation 
Index (SCI) [42]. The benchmark results are shown in 
Figure 5. The alignment qualities of ERA and LocARNA 
are nearly identical, since incorporating the sparse DP 
technique will not compromise global optimality. The 
benchmark results also show that ERA and LocARNA can 
produce more accurate alignments when compared to 
RNAf orester. This is because ERA and the restricted 
version of LocARNA are both general edit-distance RNA 
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Figure 4 Time complexity and OPM reduction of ERA. (a) Running time versus rr' , wliere n is tine average number of base pairs in ti~ie RNA 
structures, (b) OPM reduction ratio versus running time speedup ratio. Tiie OPM reduction ratio is computed by n^/z, wiierezis tiie number of OPMs. 



alignment algorithms that are capable of flexibly handling 
incorrectly predicted base-pairs, while RNAf orester is 
more sensitive to such errors, since it implements tree 
alignment. 

Running time speedup 

Finally, after benchmarking the time complexity and align- 
ment accuracy of ERA, we also expect to show that ERA is 
an efficient implementation and can run faster than other 
state-of-the-art RNA alignment tools. We compared the 



real running time of ERA, LocARNA, and RNAf orester 
on the selected RNA structures from Rfam. The bench- 
mark results are summarized in Table 1. We can observe 
that ERA is capable of speeding up LocARNA by a min- 
imum of 5.2 fold and a maximum of 91.5 fold. ERA can 
also speedup RNAf orester by a minimum of 2.8 fold 
and a maximum of 242.6 fold, with only one exception in 
which RNAf orester is 9.6 times faster than ERA. This is 
because the RNA structures being aligned (snR86) contain 
only one stem-loop structure; and in such a special case. 
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Table 1 Comparison on running time of era, LocARNA, and RNAforester 



RNA family 


I6n9th 


num. 


ERA 


LocARNA 


ERA VS. 


RNAf 0]r6S t6]r 


ERA VS. 




(bp) 


pairs 


(sec) 


(sec) 


LocARNA 

(fold) 


(sec) 


RNAf oiTGS t6]r 

(fold) 


tRNA 


78 


21 


0.017 


0.100 


5.882 


0.047 


2.765 


Gly riboswitch 


105 


22 


0.015 


0.277 


18.46 


0.162 


10.80 


U12 spliceosome 


160 


42 


0.035 


0.311 


8.886 


0.657 


18.77 


Phage_pRNA 


244 


43 


0.124 


0.647 


5.218 


6.935 


55.93 


tmRNA 


367 


64 


0.929 


22.45 


24.16 


2254 


242.6 


biocoid_3UTR 


549 


155 


4.898 


170.3 


34.77 


13.99 


2.856 


snR86 


1004 


333 


53.15 


4862 


91.48 


5.579 


-9.527* 


Sacc_telomerase 


1162 


181 


23.93 


522.3 


21.82 


3697 


154.5 



ERA is slower than RNAforester when aligning snR86 RNA structures. 



the time complexity of RNAforester becomes 0(/^) 
[21]. 

To further investigate the real running time speedup 
of ERA on randomly selected RNA structures, we com- 
piled a much larger data set that contains 1,000 pairs 
of randomly selected RNA structures from Rfam. The 
benchmark results on this large data set are summa- 
rized in Figure 6. In Figure 6, we can see that ERA (blue 
triangle) runs much faster than LocARNA (red cross) 
and RNAforester (green star). In addition, we can 
also observe that the running time of ERA grows slower 
than those of LocARNA and RNAforester, which fur- 
ther confirms our previous time complexity analysis (see 
Figure 4 (a)). This speedup is significant, and renders 
ERA with the power of aligning long ncRNAs that are 



revealed by recent research advances. In summary, ERA 
is an efficient and accurate RNA structure alignment tool 
as compared to its state-of-the-art counterparts LocARNA 
and RNAforester. 



Discussion and conclusions 

In this article, we have presented a novel algorithm for 
efficient alignment of RNA secondary structures by incor- 
porating the sparse DP technique. The major theoretical 
contribution of this work lies in two parts. First, to our 
knowledge, this is the first application of the sparse DP 
technique to RNA structure-structure alignment. Sec- 
ond, the novel online OPM pruning technique can pro- 
vide insights for future algorithm designs that need to 



A ERA running time 

+ LocARNA running time 

* RNAforester running time 




15 2 
RNA size (I * n) 

Figure 6 Computational efficiency comparison between era, LocARNA and RNAforester on aligning randomly selected RNA 
structures from Rfam. The running time for ERA (blue triangles), LocARNA (red crosses) and RNAforester (green stars) on aligning 1 ,000 pairs 
of randomly selected RNA structures from the Rfam database. The x-axis corresponds to the average sizes of the RNA structures being aligned, 
which is computed as the product of their average length (/) and their average number of base pairs (n). The y-axis corresponds to the actual 
running time in the unit of second. We can see that ERA is significantly faster than the other two tools. 
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maintain a candidate list. The implementation of this 
novel algorithm is a tool called ERA, which can run in 
Oizfi) time and 0(/^). Such time and space complex- 
ity make ERA one of the most efficient RNA structure 
alignment tools that are currently available. 

The online OPM pruning technique is newly devel- 
oped from this work, which aims at deleting obsolete 
candidates as the DP proceeds. Although this technique 
cannot improve the computational complexity, it is effi- 
cient in reducing the real running time. In Additional 
file 1: Table S3, we summarized the running time of ERA 
in aligning individual RNA structures, with or without 
the online OPM pruning technique. We observed that by 
incorporating this technique, the running time of ERA 
was reduced by an average of 2.3 fold. Meanwhile, the 
speedup ratio is highly uniform (with 1.7 fold as the 
lowest and 3.1 fold as the highest) across RNA struc- 
tures with different sizes, meaning that it reduces running 
time by a constant factor. The online OPM pruning tech- 
nique can also be modified and incorporated into other 
related algorithms that implement the candidate list, such 
as the sparse DP algorithms for RNA folding [30], RNA 
consensus folding [36,37], and RNA-RNA interaction 
[38]. 

The speedup of ERA is most significant when the num- 
ber of base pairs in the RNA structures is small. This is 
because the algorithm is indexed by base pairs and has a 
time complexity of 0{z{n^ + /^)). As n increases, the term 
OizrP') will dominate the overall time complexity. In this 
case, an ideal application of ERA is to align fixed RNA 
structures, because it guarantees that n < L Note that as 
a sparsified version of the SAF algorithm RNAscf [29], 
the new algorithm developed here is also capable of han- 
dling RNA structure ensemble alignments. However, we 
do not implement this feature into ERA, because one can- 
not guarantee n < I for RNA ensemble alignments. This 
would make the speedup of ERA less significant. Besides, 
there are other alternative tools [36,37] available for such 
a purpose. 

With the completion of the ENCODE [10] and mod- 
ENCODE [11] projects, more and more RNA tran- 
scripts will be experimentally revealed. At the same 
time, with the advance of high- throughput RNA struc- 
ture probing techniques [13-15], the secondary struc- 
tures of these RNA transcripts will also be predicted 
with a much higher accuracy. In this case, ERA, which 
can compare fixed RNA structure efficiently and accu- 
rately, becomes an ideal computational tool to evalu- 
ate the structural similarities of these RNA transcripts. 
ERA can be used to perform all-against-all alignments on 
these RNA transcripts, which will then be subsequently 
summarized as the distance matrix for clustering pur- 
poses. Various clustering algorithms [8,39] can then be 
applied to identify ncRNA families with similar secondary 



structures and infer their amazing cellular and molecular 
functionalities. 

Additional file 



Additional file 1 : Supplementary information. This file contains four 
sections. In Section SI, we briefly discuss the space issue of era and 
provide related experimental results. In Section S2, we document the 
randomly selected RNA structures used for experiments mentioned in the 
main article. In Section S3, we evaluate the impact of the online OPM 
pruning technique in speeding up ERA. In Section S4, we give examples of 
the pruned base-pairing probability matrix for executing LocARNA. 
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